Passive-Aggressive Sequence Labeling with Discriminative Post-Editing for Recognising Person Entities in Tweets
نویسندگان
چکیده
Recognising entities in social media text is difficult. NER on newswire text is conventionally cast as a sequence labeling problem. This makes implicit assumptions regarding its textual structure. Social media text is rich in disfluency and often has poor or noisy structure, and intuitively does not always satisfy these assumptions. We explore noise-tolerant methods for sequence labeling and apply discriminative post-editing to exceed state-of-the-art performance for person recognition in tweets, reaching an F1 of 84%.
منابع مشابه
An Extended Study of Content and Crowdsourcing-related Performance Factors in Named Entity Annotation
Hybrid annotation techniques have emerged as a promising approach to carry out named entity recognition on noisy microposts. In this paper, we identify a set of content and crowdsourcing-related features (number and type of entities in a post, average length and sentiment of tweets, composition of skipped tweets, average time spent to complete the tasks, and interaction with the user interface)...
متن کاملOnline adaptation strategies for statistical machine translation in post-editing scenarios
One of the most promising approaches to machine translation consists in formulating the problem by means of a pattern recognition approach. By doing so, there are some tasks in which online adaptation is needed in order to adapt the system to changing scenarios. In the present work, we perform an exhaustive comparison of four online learning algorithms when combined with two adaptation strategi...
متن کاملRecognizing Named Entities in Tweets
The challenges of Named Entities Recognition (NER) for tweets lie in the insufficient information in a tweet and the unavailability of training data. We propose to combine a K-Nearest Neighbors (KNN) classifier with a linear Conditional Random Fields (CRF) model under a semi-supervised learning framework to tackle these challenges. The KNN based classifier conducts pre-labeling to collect globa...
متن کاملUnderstandability of Machine-translated Hindi Tweets Before and After Post-editing: Perspectives for a Recommender System
In the process of building a recommender system based on Hindi tweets for a project, we want to determine whether raw Machine Translation (MT) results could be useful. We collected 100K such tweets and experimented on 200 of them as a preliminary step. Less than 50% of the machine-translated tweets were understandable by English speakers, while at least 80% understandability seems to be require...
متن کاملLearning Character Representations for Chinese Word Segmentation
We propose a simple yet effective semi-supervised method for improving Chinese Word Segmentation. Our method is based on learning generalizable vector and cluster representations of variable-length character sequences from large unlabeled data, which is then incorporated into a sequence labeling model with the passive-aggressive algorithm as features. We achieve state-of-the-art results on the ...
متن کامل